Methada 2020: VESPUCCI exercise n° 1

In this first excercise we will perform few basic operations with VESPUCCI. We will create a Module starting from few genes and then automatically extend it by adding more genes. We will have a look at gene and samples annotations. VESPUCCI is the gene expression database for grapevine and we can access it via its GraphQL interface, called COMPASS. The pyCOMPASS package is a Python package that wraps some functionalities to simplify communication with the COMPASS intereface.

In [74]:
%%html

<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>

Import packages

In [75]:
from pycompass import Connect, Compendium, Module, Platform, Experiment, BiologicalFeature, SampleSet, Sample, Plot, Annotation
from IPython.core.display import display, HTML

Create a Connnect object and pass it the URL to point it to the COMPASS GraphQL endpoint

In [76]:
url = 'http://methada2020.uv.es:5555/graphql'
conn = Connect(url)

Since COMPASS is our interface to (possibily) many compendium, we'll list all of them and select the one we are interested in, i.e. VITIS_VINIFERA

In [77]:
vv_compendium = None
for compendium in conn.get_compendia():
    if compendium.compendium_name == 'vitis_vinifera':
        vv_compendium = compendium
        break

Let's print out the description for this compendium

In [78]:
print(vv_compendium.description)
The Vitis vinifera gene expression compendium

Let's build our module starting from a bunch of top ranking genes from your previous excercise (contrast different cultivars)

In [79]:
gene_names = ['VIT_05s0094g00350','VIT_07s0031g02630','VIT_19s0015g02480','VIT_08s0007g08840','VIT_01s0026g00520','VIT_03s0017g02170','VIT_19s0014g05330',
              'VIT_02s0154g00130','VIT_02s0025g04330','VIT_13s0067g00490','VIT_09s0002g01200','VIT_14s0030g00140','VIT_03s0063g00120','VIT_05s0029g01480',
              'VIT_11s0052g01650','VIT_02s0087g01020','VIT_09s0070g00160','VIT_13s0019g02180','VIT_07s0095g00550','VIT_04s0008g06570','VIT_04s0069g00860',
              'VIT_04s0210g00060','VIT_07s0104g00430','VIT_15s0107g00210','VIT_16s0039g00970','VIT_10s0003g01730','VIT_17s0000g07060','VIT_16s0100g00510',
              'VIT_02s0154g00590']

We can query the compendium with the list of gene names and get a list of BiologicalFeature objects that represents our genes of interest.

In [80]:
genes = BiologicalFeature.using(vv_compendium).get(filter={'name_In': gene_names})

We are now ready to create our first module: the compendium will need the list of genes (BiologicalFeature objects) and the name of the normalization we want to use in order to automatically select the "best" conditions.

Since the module creation process might take few moments we will read it from a file I previously prepared. I commented the code used to create the module and save it to a local file.

In [81]:
#module_1 = Module.using(vv_compendium).create(biofeatures=genes, normalization='legacy')
#module_1.write_to_file('module_1.vsp')
module_1 = Module.read_from_file('module_1.vsp', conn)

A module is a matrix that represent a portion of genes and a portion of conditions of the whole compendium. Let's see its values.

In [82]:
module_1.values
Out[82]:
array([[ 2.0501e+00,  1.9925e+00,  1.9277e+00, ..., -1.1679e+00,
         9.3332e-01,  1.9212e+00],
       [-2.3065e+00, -2.0012e+00, -2.4251e+00, ..., -6.4455e-01,
        -5.5174e+00, -1.4678e+00],
       [        nan,         nan,         nan, ...,  3.9767e+00,
         2.8815e-01,  2.7764e+00],
       ...,
       [ 4.5300e+00,  4.4810e+00,  4.7623e+00, ...,  9.4248e-01,
         5.1820e+00,  5.9341e+00],
       [        nan,         nan,         nan, ...,  3.0111e-02,
        -7.9907e-04,  3.8482e-03],
       [ 2.6310e+00,  2.3106e+00,  2.8507e+00, ..., -2.0859e+00,
         4.7642e-01,  3.0541e+00]])

Now we will plot the heatmap for this module using the Plot object. The plot_heatmap method will return normal HTML + Javascript code so since we are using a web-browser to display thing we will just need to tell Jupyter to interpret the HTML + Javascript code for us

In [83]:
html_hm_module_1 = Plot(module_1).plot_heatmap()

The images are rendered client-size (that is the browser is creating the image since the html_hm_module_1 variable only contains HTML + Javascript code). Plotly is the package used to create the plot and comes with few tools to pan and zoom the image

In [84]:
display(HTML(html_hm_module_1))

One basic functionality users usually want to explore is to find other genes (not part of the initial query) that show an expression pattern similar to the initial query genes for the same conditions.

We can rank BiologicalFeatures or SampleSets based on different scores

The way to do rank BiologicalFeature is first to plot the BiologicalFeature distribution and then select a cut-off based on the co-expression.

In [85]:
vv_compendium.get_score_rank_methods(normalization='legacy')
Out[85]:
{'sampleSets': ['magnitude', 'coexpression'], 'biologicalFeatures': ['std']}
In [86]:
dist_module_1 = Plot(module_1).plot_distribution(plot_type='biological_features_standard_deviation_distribution')
In [87]:
display(HTML(dist_module_1))

Let's create a copy of our module by reading again the module_1 file

In [88]:
module_2 = Module.read_from_file('module_1.vsp', conn)
In [89]:
module_2.values
Out[89]:
array([[ 2.0501e+00,  1.9925e+00,  1.9277e+00, ..., -1.1679e+00,
         9.3332e-01,  1.9212e+00],
       [-2.3065e+00, -2.0012e+00, -2.4251e+00, ..., -6.4455e-01,
        -5.5174e+00, -1.4678e+00],
       [        nan,         nan,         nan, ...,  3.9767e+00,
         2.8815e-01,  2.7764e+00],
       ...,
       [ 4.5300e+00,  4.4810e+00,  4.7623e+00, ...,  9.4248e-01,
         5.1820e+00,  5.9341e+00],
       [        nan,         nan,         nan, ...,  3.0111e-02,
        -7.9907e-04,  3.8482e-03],
       [ 2.6310e+00,  2.3106e+00,  2.8507e+00, ..., -2.0859e+00,
         4.7642e-01,  3.0541e+00]])

Now that we saw the plot, let's create a rank of the BiologicalFeature

In [90]:
rank_genes = vv_compendium.rank_biological_features(module_1, rank_method="std")

And pick the 10 best genes

In [91]:
new_gene_names = rank_genes['ranking']['id'][:10]
In [92]:
new_genes = BiologicalFeature.using(vv_compendium).get(filter={'id_In': new_gene_names})

To add to the second module

In [93]:
module_2.add_biological_features(new_genes)

And then plot the newly created module

In [94]:
html_hm_module_2 = Plot(module_2).plot_heatmap()
In [95]:
display(HTML(html_hm_module_2))

By making the difference between the two modules, fixed by SampleSets (that are the same between modules), we can see just what we added

In [96]:
module_3 = Module.difference(module_2, module_1, sample_sets=False)
In [97]:
html_hm_module_3 = Plot(module_3).plot_heatmap()
In [98]:
display(HTML(html_hm_module_3))

Display the name of one of the newly picked gene

In [99]:
new_genes[6].name
Out[99]:
'VIT_05s0049g00760'

And display the annotation as a RDF-graph

In [100]:
html_annotation_gene = Annotation(BiologicalFeature.using(vv_compendium).get(filter={'name': 'VIT_05s0049g00760'})[0]).plot_network()
In [101]:
display(HTML(html_annotation_gene))

We can also display the SampleSet annotation (or better the annotation of the Samples that compose the SampleSet) as RDF-graph

In [102]:
for s in module_3.sample_sets[0]:
    display(HTML(Annotation(s).plot_network()))

Or as RDF-triples

In [103]:
for s in module_3.sample_sets[0]:
    for t in Annotation(s).get_triples():
        print(' -- '.join(t))
    print()
GSM1010462.ch1 -- whole plant development stage -- 33
GSM1010462.ch1 -- Year of Onset -- Ndfad73f667414b479b23999232fb39cf
Ndfad73f667414b479b23999232fb39cf -- value -- 2005
UO_0000036 -- name -- year
Ndfad73f667414b479b23999232fb39cf -- unit -- UO_0000036
GSM1010462.ch1 -- Year of Onset -- N84c40bd3dab143e88707e7dea521eef2
UO_0000036 -- name -- year
N84c40bd3dab143e88707e7dea521eef2 -- unit -- UO_0000036
N84c40bd3dab143e88707e7dea521eef2 -- value -- 2006
PO_0009087 -- name -- mesocarp
GSM1010462.ch1 -- type -- PO_0009087
GSM1010462.ch1 -- cultivar -- Muscat Hamburg
NCBITaxon_29760 -- name -- Vitis vinifera
GSM1010462.ch1 -- Genotype -- NCBITaxon_29760
GSM1010462.ch1 -- is grown in -- N827400bd2e16439e9e0f09b2e86e47c9
N827400bd2e16439e9e0f09b2e86e47c9 -- type -- ENVO_00000116
N827400bd2e16439e9e0f09b2e86e47c9 -- http://purl.obolibrary.org/obo/GAZ_00000448 -- N3af761ca7ee14193bbab0662e6bbbb59
N3af761ca7ee14193bbab0662e6bbbb59 -- Longitude -- Nf562c28a34a74ba5a02ede8433ff6855
N3af761ca7ee14193bbab0662e6bbbb59 -- Latitude -- N7b3177ae092f4b588653c8a070064996
N3af761ca7ee14193bbab0662e6bbbb59 -- Description -- Torreblanca Spain
UO_0000185 -- name -- degree
N7b3177ae092f4b588653c8a070064996 -- unit -- UO_0000185
N7b3177ae092f4b588653c8a070064996 -- value -- 37.634939
Nf562c28a34a74ba5a02ede8433ff6855 -- value -- -0.890192
UO_0000185 -- name -- degree
Nf562c28a34a74ba5a02ede8433ff6855 -- unit -- UO_0000185

GSM1010480.ch1 -- Year of Onset -- N87ab8e5eaba149e2a811c81004fdc03d
UO_0000036 -- name -- year
N87ab8e5eaba149e2a811c81004fdc03d -- unit -- UO_0000036
N87ab8e5eaba149e2a811c81004fdc03d -- value -- 2005
GSM1010480.ch1 -- Year of Onset -- N4f2ccbf5aef34386b967ab78321c663a
N4f2ccbf5aef34386b967ab78321c663a -- value -- 2006
UO_0000036 -- name -- year
N4f2ccbf5aef34386b967ab78321c663a -- unit -- UO_0000036
PO_0009087 -- name -- mesocarp
GSM1010480.ch1 -- type -- PO_0009087
GSM1010480.ch1 -- cultivar -- Muscat Hamburg
NCBITaxon_29760 -- name -- Vitis vinifera
GSM1010480.ch1 -- Genotype -- NCBITaxon_29760
GSM1010480.ch1 -- is grown in -- N93a1ccf688e14e26baef8c7cf201f81e
N93a1ccf688e14e26baef8c7cf201f81e -- http://purl.obolibrary.org/obo/GAZ_00000448 -- N3aed5cec11b34a9588ab228e3fce3926
N93a1ccf688e14e26baef8c7cf201f81e -- type -- ENVO_00000116
N3aed5cec11b34a9588ab228e3fce3926 -- Latitude -- Nbceb1e597ca4415e831dd14e9138115f
N3aed5cec11b34a9588ab228e3fce3926 -- Description -- Torreblanca Spain
N3aed5cec11b34a9588ab228e3fce3926 -- Longitude -- N761a9a8469954ff6b9a0d3ec3f8f2119
N761a9a8469954ff6b9a0d3ec3f8f2119 -- value -- -0.890192
UO_0000185 -- name -- degree
N761a9a8469954ff6b9a0d3ec3f8f2119 -- unit -- UO_0000185
Nbceb1e597ca4415e831dd14e9138115f -- value -- 37.634939
UO_0000185 -- name -- degree
Nbceb1e597ca4415e831dd14e9138115f -- unit -- UO_0000185
GSM1010480.ch1 -- whole plant development stage -- 38

In [ ]: